Learning Convolutional Action Primitives from Multimodal Timeseries Data*
نویسندگان
چکیده
Fine-grained action recognition is important for many applications such as human-robot interaction, automated skill assessment, and surveillance. The goal is to predict what action is occurring at any point in a timeseries sequence. While recent work has improved recognition performance in robotics applications, these methods often require hand-crafted features or large use of domain knowledge. Furthermore these methods tend to model actions using pointwise estimates of individual frames or statistics on collections of frames. In this work we develop a notion of an action primitive that models how generic features transition over the course of an action. Our Latent Convolutional Skip Chain Conditional Random Field (LC-SCCRF) model learns a set of interpretable and composable action primitives. We apply our model to the cooking and robotic surgery domains using the University of Dundee 50 Salads dataset and the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS). Each consists of multimodal timeseries data including video, robot kinematics, and/or kitchen object accelerations. Our recognition performance on 50 Salads and JIGSAWS are 18.0% and 5.3% higher than the state of the art. Our model has the advantage of being more general than many recent approaches and performs well without requiring hand-crafted features or intricate domain knowledge. Upon publication we will release our LC-SC-CRF code and the features used on the two datasets.
منابع مشابه
Multimodal Skipgram Using Convolutional Pseudowords
This work studies the representational mapping across multimodal data such that given a piece of the raw data in one modality the corresponding semantic description in terms of the raw data in another modality is immediately obtained. Such a representational mapping can be found in a wide spectrum of real-world applications including image/video retrieval, object recognition, action/behavior re...
متن کاملDeep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis
We present a novel way of extracting features from short texts, based on the activation values of an inner layer of a deep convolutional neural network. We use the extracted features in multimodal sentiment analysis of short video clips representing one sentence each. We use the combined feature vectors of textual, visual, and audio modalities to train a classifier based on multiple kernel lear...
متن کاملMultimodal Deep Learning for Cervical Dysplasia Diagnosis
To improve the diagnostic accuracy of cervical dysplasia, it is important to fuse multimodal information collected during a patient’s screening visit. However, current multimodal frameworks suffer from low sensitivity at high specificity levels, due to their limitations in learning correlations among highly heterogeneous modalities. In this paper, we design a deep learning framework for cervica...
متن کاملLearning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network
We present an end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images. We consider document semantic structure extraction as a pixel-wise segmentation task, and propose a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. More...
متن کاملA Hybrid Method for Traffic Flow Forecasting Using Multimodal Deep Learning
Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which jointly learns the spatial-temporal correlation features and interdependence of multi-modality traffic data by multimodal deep learning architecture. According to the highly nonlinear charac...
متن کامل